Glossary of Big Data, MapReduce, Spark

Compiled by: Mahmoud Parsian

Last updated: 1/21/2023

This glossary is written for my students taking Big Data Modeling & Analytics at Santa Clara University.

Introduction

Big data is a vast and complex field that is constantly evolving, and for that reason, it’s important to understand the basic common terms and the more technical vocabulary so that your understanding can evolve with it.

Big data environment involves many tools and technologies:

The purpose of this glossary is to shed some light on the fundamental definitions of big data and MapReduce, and Spark. This document is a list of terms, words, and concepts found in or relating to big data, MapReduce, and Spark.

Algorithm

Typically an algorithm is implemented using a programming language such as Python, Java, SQL, ...

In big data world, an algorithm can be implemented using a compute engine such as MapReduce and Spark.

In The Art of Computer Programming, a famous computer scientist, Donald E. Knuth, defines an algorithm as a set of steps, or rules, with five basic properties:

1) Finiteness. An algorithm must always terminate after a finite number of steps.

2) Definiteness. Each step of an algorithm must be precisely defined

3) Input. An algorithm has zero or more inputs

4) Output. An algorithm has one or more outputs

5) Effectiveness. An algorithm is also generally expected to be effective

Distributed algorithm

A distributed algorithm is an algorithm designed to run on computer hardware constructed from interconnected processors. Distributed algorithms are used in different application areas of distributed computing, such as DNA analysis, telecommunications, scientific computing, distributed information processing, and real-time process control. Standard problems solved by distributed algorithms include leader election, consensus, distributed search, spanning tree generation, mutual exclusion, finding association of genes in DNA, and resource allocation. Distributed algorithms run in parallel/concurrent environments.

In implementing distributed algorithms, you have to make sure that your aggregations and reductions are semantically correct (since these are executed partition by partition) regardless of the number of partitions for your data. For example, you need to remember that average of an average is not an average.

Example of systems running distributed algorithms:

Partitioner

Partitioner is a program, which distributes the data across the cluster. The types of partitioners are

For example, an Spark RDD of 480,000,000,000 elements might be partitioned in to 60,000 chunks (partitions), where each chunk/partition will have a bout 8,000,000 elements.

480,000,000,000 = 60,000 x 8,000,000

One of the main reasons of data partitioning is to process many small partitions in parallel (at the same time) to reduce the overall data processing time.

Aggregation

Data Aggregation

Data aggregation refers to the collection of data from multiple sources to bring all the data together into a common athenaeum for the purpose of reporting and/or analysis.

What is Data Aggregation? Data aggregators summarize data from multiple data sources. They provide capabilities for multiple aggregate measurements, such as sum, median, average and counting.

Analytics

What is Data Analytics? Data analytics helps individuals and organizations make sense of data. Data analysts typically analyze raw data for insights, patterns, and trends.

Anonymization

API

Application

Data sizes

~ denotes "about"

Behavioural Analytics

Behavioural Analytics is a kind of analytics that informs about the how, why and what instead of just the who and when. It looks at humanized patterns in the data.

Big Data

Big data is an umbrella term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data-processing applications. In a nutshell, big data refers to data that is so large, fast or complex that it's difficult or impossible to process using traditional methods. Also, big data deals with accessing and storing large amounts of information for analytics.

So, what is Big Data? Big Data is a large data set with increasing volume, variety and velocity.

Big data solutions may have many components (to mention some):

Big Data Platforms/Solutions

Biometrics

The use of data and technology to identify people by one or more of their physical traits (for example, face recognition)

Data Modelling

The analysis of data sets using data modelling techniques to create insights from the data:

There are 5 different types of data models:

Design Patterns

What is a design pattern? In software engineering, a design pattern is a general repeatable solution to a commonly occurring problem in software design. In general, design patterns are categorized mainly into three categories:

Gang of Four (Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides) Design Patterns is the collection of 23 design patterns from the book Design Patterns: Elements of Reusable Object-Oriented Software.

What are data design patterns? Data Design Pattern is a general repeatable solution to a commonly occurring data problem in big data area.

The following are common Data Design Patterns:

The data design patterns can be implemented by MapReduce and Spark and other big data solutions.

Data Set

A collection of (structured, semi-structured, and unstructured) data.

Example of Data Sets:

Data Type

In computer science and computer programming, a data type (or simply type) is a set of possible values and a set of allowed operations on it. A data type tells the compiler or interpreter how the programmer intends to use the data.

For example,

Primitive data type

A data type that allows you to represent a single data value in a single column position. In a nutshell, a primitive data type is either a data type that is built into a programming language, or one that could be characterized as a basic structure for building more sophisticated data types.

Composite data type

In computer science, a composite data type or compound data type is any data type which can be constructed in a program using the programming language's primitive data types.

In Java and Python, custom composite data types can be created by the concept of "class" and objects are created by instantiation of the class objects.

Apache Hadoop

Hadoop is an open-source framework that is built to enable the process and storage of big data across a distributed file system. Hadoop implements MapReduce paradigm, it is slow and complex and uses disk for read/write operations. Hadoop does not take advantage of in-memory computing. Hadoop runs a computing cluster.

Hadoop takes care of running your MapReduce code (by map() first, then reduce() logic) across a cluster of machines. Its responsibilities include chunking up the input data, sending it to each machine, running your code on each chunk, checking that the code ran, passing any results either on to further processing stages or to the final output location, performing the sort that occurs between the map and reduce stages and sending each chunk of that sorted data to the right machine, and writing debugging information on each job’s progress, among other things.

Hadoop provides:

What is the difference between Hadoop and RDBMS?

Criteria Hadoop RDBMS
Data Types Processes semi-structured and unstructured data Processes structured data
Schema Schema on Read Schema on Write
Best Fit for Applications Data discovery and Massive Storage/Processing of Unstructured data. Best suited for OLTP and ACID transactions
Speed Writes are Fast Reads are Fast
Data Updates Write once, Read many times Read/Write many times
Data Access Batch Interactive and Batch
Data Size Tera bytes to Peta bytes Giga bytes to Tera bytes
Development Time consuming and complex Simple
API Low level (by map() and reduce()) functions SQL and extensive

Replication Factor (RF)

The total number of replicas across the cluster is referred to as the replication factor (RF). A replication factor of 1 means that there is only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica.

Given a cluster of N+1 nodes (a master and N worker nodes), if data replication factotr is R, then (R - 1) nodes can safely fail without impacting any running job in the cluster.

What makes Hadoop Fault tolerant?

Hadoop is said to be highly fault tolerant. Hadoop achieves this feat through the process of data replication. Data is replicated across multiple nodes in a Hadoop cluster. The data is associated with a replication factor (RF), which indicates the number of copies of the data that are present across the various nodes in a Hadoop cluster. For example, if the replication factor is 4, the data will be present in four different nodes of the Hadoop cluster, where each node will contain one copy each. In this manner, if there is a failure in any one of the nodes, the data will not be lost, but can be recovered from one of the other nodes which contains copies or replicas of the data.

If replication factor is N, then N-1 nodes can safely fail without impacting a running job.

Big Data Formats

Data comes in many varied formats:

Parquet Files

Apache Parquet is a columnar file format that supports block level compression and is optimized for query performance as it allows selection of 10 or less columns from from 50+ columns records.

Apache Spark can read/write from/to Parquet data format.

Parquet is a columnar open source storage format that can efficiently store nested data which is widely used in Hadoop and Spark.

Characteristics of Parquet:

Benefits of Parquet:

Columnar vs. Row Oriented Databases

Columnar databases have become the popular choice for storing analytical data workloads. In a nutshell, Column oriented databases, store all values from each column together whereas row oriented databases store all the values in a row together.

If you need to read MANY rows but only a FEW columns, then Column-Oriented databases are the way to go. If you need to read a FEW rows but MANY columns then row oriented databases are better suited.

Tez

Apache Tez (which implements MapReduce paradigm) is a framework to create high performance applications for batch and data processing. YARN of Apache Hadoop coordinates with it to provide the developer framework and API for writing applications of batch workloads.

The Tez is aimed at building an application framework which allows for a complex directed-acyclic-graph (DAG) of tasks for processing data. It is currently built atop Apache Hadoop YARN.

Apache HBase

Apache HBase is n open source, non-relational, distributed database running in conjunction with Hadoop. HBase can support billions of data points.

Features of HBase:

HDFS

HDFS (Hadoop Distributed File System) is a distributed file system designed to run on commodity hardware. You can place huge amount of data in HDFS. You can create new files or directories. You can delete files, but you can not edit/update files in place.

Features of HDFS:

Commodity server/hardware

Commodity hardware (computer), sometimes known as off-the-shelf server/hardware, is a computer device or IT component that is relatively inexpensive, widely available and basically interchangeable with other hardware of its type. Since commodity hardware is not expensive, it is used in building/creating clusters for big data computing (scale-out architecture). Commodity hardware is often deployed for high availability and disaster recovery purposes.

Fault Tolerance and Data Replication.

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance.

Block size can be configured. For example, let block size to be 512MB. Now, let's place a file (sample.txt) of 1800MB in HDFS:

1800MB = 512MB (Block-1) + 512MB (Block-2) + 512MB (Block-3) + 264MB (Block-4)
Lets denote 
           Block-1 by B1
           Block-2 by B2
           Block-3 by B3
           Block-4 by B4

Note that the last block has only 264MB of useful data.

Let's say, we have a cluster of 6 nodes (one master and 5 worker nodes {W1, W2, W3, W4, W5} and master does not store any data), also assume that the replication factor is 2, therefore, blocks will be placed as:

    W1: B1, B4
    W2: B2, B3
    W3: B3, B1
    W4: B4 
    W5: B2

Fault Tolerance: if replication factor is N, then (N-1) nodes can safely fail without a job fails.

High-Performance-Computing (HPC)

Using supercomputers to solve highly complex and advanced computing problems. This is a scale-up architecture and not a scale-out architecture.

Hadoop and Spark use scale-out architectures.

History of MapReduce

MapReduce was developed by Google back in 2004 by Jeffery Dean and Sanjay Ghemawat of Google (Dean & Ghemawat, 2004). In their paper, “MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS,” and was inspired by the map() and reduce() functions commonly used in functional programming. At that time, Google’s proprietary MapReduce system ran on the Google File System (GFS). Apache Hadoop is an open-source implementation of Google's MapReduce.

MapReduce

Mapreduce is a software framework for processing vast amounts of data. MapReduce is a parallel programming model for processing data on a distributed system. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

In a nutshell, MapReduce provides 3 functions to analyze huge amounts of data:

The genie/magic of MapReduce is a Sort & Shuffle phase (provided by MapReduce implementation), which groups keys generated by all mappers. For example, if all mappers have created the following (key, value) pairs:

(C, 4), (C, 5), 
(A, 2), (A, 3), 
(B, 1), (B, 2), (B, 3), (B, 1), 
(D, 7)

then Sort & Shuffle phase creates the following (key, value) pairs (not in any particular order) to be consumed by reducers:

(A, [2, 3])
(B, [1, 2, 3, 1])
(C, [4, 5])
(D, [7])

Options for MapReduce implementation:

Mapreduce Architecture

Components of MapReduce Architecture:

MapReduce Task:

The MapReduce Task is mainly divided into 3 phases i.e. Map phase, Sort & Shuffle pahse and Reduce phase.

which will be fed as input to the reducers. In MapReduce paradigm, Sort & Shuffle is handled by the MapReduce implementation and it is so called the genie of the MapReduce paradigm. A programmer does not write any code for the Sort & Shuffle phase.

For example, for a MapReduce job, if all mappers have created the following (key, value) pairs (with 3 distinct keys as {A, B, C}:

    (A, 2), (A, 3)
    (B, 4), (B, 5), (B, 6), (B, 7)
    (C, 8)

Then Sort & Shuffle phase will produce the following output (which will be sent as input to the reducers -- note the values are not sorted in any order at all):

    (A, [2, 3])
    (C, [8])
    (B, [7, 4, 5, 6])

For the example, listed above, 3 reducers will be executed (in parallel):

    reduce(A, [2, 3])
    reduce(C, [8])
    reduce(B, [7, 4, 5, 6])

where each reducer can generate any number of new (key3, value3) pairs.

What is an Example of a Mapper in MapReduce

Imagine that you have records, which describe values for genes and each record is identified as:

<gene_id><,><value_1><,><value_2>

Sample records might be:

INS,1.1,1.4
INSR,1.7,1.2

Suppose the goal is to find the median value for the smaller of the two gene values. Therefore we need to produce (key, value) pairs such that key is a gene_id and value is minimum of <value_1> and <value_2>.

The following pseudo-code will accomplish the mapper task:

# key: record number or offset of a record number
# key will be ignored since we do not need it
# value: an actual record with the format of:
# <gene_id><,><value_1><,><value_2>
map(key, value) {
   # tokenize input record
   tokens = value.split(",")
   gene_id = tokens[0]
   value_1 = double(tokens[1])
   value_2 = double(tokens[2])
   minimum = min(value_1, value_2)
   # now emit output of the mapper:
   emit(gene_id, minimum)
}

For example, if we had the following input:

INS,1.3,1.5
INS,1.1,1.4
INSR,1.7,1.2
INS,1.6,1.0
INSR,0.7,1.2

Then output of mappers will be:

(INS, 1.3)
(INS, 1.1)
(INSR, 1.2)
(INS, 1.0)
(INSR, 0.7)

Note that, for the preceding mappers output, the Sort & Shuffle phase will produce the follwong (key, values) pairs to be consumed ny the reducers.

(INS, [1.3, 1.1, 1.0])
(INSR, [1.2, 0.7])

What is an Example of a Reducer in MapReduce

Imagine that mappers have produced the following output: (key, value) where key is a gene_id and value is an associated gene value:

(INS, 1.3)
(INS, 1.1)
(INSR, 1.2)
(INS, 1.0)
(INSR, 0.7)

Note that, for the preceding mappers output, the Sort & Shuffle phase will produce the follwong (key, values) pairs to be consumed by the reducers.

(INS, [1.3, 1.1, 1.0])
(INSR, [1.2, 0.7])

Now, assume that the goal of reducers is to find the median of values per key (as a gene_id). For simplicity, we assume that there exists a median() function, which accepts a list of values and computes the median of given values.

# key: a unique gene_id
# values: Iteable<Double> (i.e., as a list of values)
reduce(key, values) {
   median_value = median(values)
   # now output final (key, value)
   emit(key, median_value)
}

Therefore, with this reducer, reducers will create the following (key, value) pairs:

(INS, 1.1)
(INSR, 0.95)

What is an Example of a Combiner in MapReduce

Consider a classic word count program in MapReduce. Let's Consider 3 partitions with mappers output:

Partition-1   Partition-2    Partition-3
===========   ===========    ===========
(A, 1)        (A, 1)         (C, 1)
(A, 1)        (B, 1)         (C, 1)
(B, 1)        (B, 1)         (C, 1)
(B, 1)        (C, 1)         (C, 1)
(B, 1)                       (B, 1)

Without a combiner, Sort & Shuffle will output the following (for all partitions):

(A, [1, 1, 1])
(B, [1, 1, 1, 1, 1, 1])
(C, [1, 1, 1, 1, 1])

With a combiner, Sort & Shuffle will output the following (for all partitions):

(A, [2, 1])
(B, [3, 2, 1])
(C, [1, 4])

As you can see, with a combiner, values are combined for the same key on a partition-by-partition basis. In MapReduce, combiners are mini-reducer optimizations and they reduce network traffic by combining many values into a single value.

Partition

Data can be partitioned into smaller logical units. These units are called partitions. In big data, partitions are used as a unit of parallelisim.

For example, in a nutshell, Apache spark partitions your data and then each partition is executed by an executor.

For example, given a data size of 80,000,000,000 records, this data can be partitioned into 80,000 chunks, where each chunk/partition will have about 1000,0000 records. Then in a transformation (such as mapper, filter, ...) these partitions can be processed in parallle. The maximum parallelism for this example is 80,000. If the cluster does not have 80,000 points of parallelism, then some of the partitions will be queued for parallelism.

In MapReduce, input is partitioned and then passed to mappers (so that the mappers can be run in parallel).

In Apache Spark, a programmer can control the partitioning data (by using coalesce(), ...) and hence controlling paralleism.

Spark examples:

Parallel computing

Parallel computing (also called concurrent computing) is a type of computation in which many calculations or processes are carried out simultaneously (at the same time). Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, ... parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

MapReduce and Spark employs parallelism by data partitioning.

How does MapReduce work?

A MapReduce system (an implementation of MapReduce mpdel) is usually composed of three steps (even though it's generalized as the combination of Map and Reduce operations/functions). The MapReduce operations are:

Word Count in MapReduce

Given a set of text documents (as input), Word Count algorithm finds frequencies of unique words in input. The map() and reduce() functions are provided as a pseudo-code.

Finding Average in MapReduce

Given a set of geneid(s) and genevalue(s) (as input), the average algorithm finds average of gene values per gene_id for canceric genes. Assume that the input is formatted as:

    <gene_id_as_string><,><gene_value_as_double><,><cancer-or-benign>

    where <cancer-or-benign> has value as {"cancer", "benign"}

The map() and reduce() functions are provided as a pseudo-code.

To have a combiner function, we have to change the output of mappers (since avg of avg is not an avg). This means that avg function is a commutative, but not assocaitive. Changing output of mappers will make it commutative and associative.

Commutative means that:

    avg(a, b) = avg(b, a)

Associative means that:

    avg(avg(a, b), c) = avg(a, avg(b, c))

For details on commutative and associative properties refer to Data Aldorithms with Spark.

What is an Associative Law

An associative operation:

    f: X x X -> X

is a binary operation such that for all a, b, c in X:

    f(a, f(b, c)) = f(f(a, b), c)

For example, + (addition) is an associative function because

    (a + (b + c)) = ((a + b) + c)

For example, * (multiplication) is an associative function because

    (a * (b * c)) = ((a * b) * c)

While, - (subtraction) is not an associative function because

    (4 - (6 - 3) != ((4 - 6) - 3)
         (4 - 3) != (-2 - 3)
               1 != -5

While average operation is not an associative function.

FACT: avg(1, 2, 3) = 2

avg(1, avg(2, 3)) != avg(avg(1, 2), 3)
    avg(1, 2.5) != avg(1.5, 3)
           1.75 != 2.25

What is a Commutative Law

A commutative function f is a function that takes multiple inputs from a set X and produces an output that does not depend on the ordering of the inputs. For example, the binary operation + is commutative, because 2 + 5 = 5 + 2. Function f is commutative if the following property holds:

    f(a, b) = f(b, a)

While, - (subtraction) is not an commutative function because

    2 - 4 != 4 - 2
       -2 != 2

Monoid

Monoids are algebraic structures. A monoid M is a triplet (X, f, i), where

The monoid axioms (which govern the behavior of f) are as follows.

  1. (Closure) For all a, b in X, f(a, b) and f(b, a) is also in X.
  2. (Associativity) For all a, b, c in X:

    f(a, f(b, c)) = f(f(a, b), c)
  3. (Identity) There is an i in X such that, for all a in X:

    f(a, i) = f(i, a) = a

Monoid Examples

Example-1

Let X denotes non-negative integer numbers.

Example-2

Let S denote a set of strings including an empty string ("") of length zero, and || denote a concatenation operator,

Then M(S, ||, "") is a monoid.

Non Monoid Examples

Then M(X, -, 0) is not a monoid, since binary subtraction function is not an associative function.

Then M(X, /, 1) is not a monoid, since binary division function is not an associative function.

Then M(X, AVG, 0) is not a monoid, since AVG (an averge function) is not an associative function.

Monoids as a Design Principle for Efficient MapReduce Algorithms

According to Jimmy Lin: "it is well known that since the sort/shuffle stage in MapReduce is costly, local aggregation is one important principle to designing efficient algorithms. This short paper represents an attempt to more clearly articulate this design principle in terms of monoids, which generalizes the use of combiners and the in-mapper combining pattern.

For example, in Spark (using PySpark), in a distributed computing environment, we can not write the following transformation to find average of integer numbers per key:

    # rdd: RDD[(String, Integer)] : RDD[(key, value)]
    # The Following Transformation is WRONG
    avg_per_key = rdd.reduceByKey(lambda x, y: (x+y) / 2)

This will not work, because averge of average is not an average. In Spark, RDD.reduceByKey() merges the values for each key using an associative and commutative reduce function. Average function is not an associative function.

How to fix this problem? Make it a Monoid:

    # rdd: RDD[(String, Integer)] : RDD[(key, value)]
    # convert (key, value) into (key, (value, 1))
    # rdd2 elements will be monoidic structures for addition +
    rdd2 = rdd.mapValues(lambda v: (v, 1))
    # rdd2: RDD[(String, (Integer, Integer))] : RDD[(key, (sum, count))]

    # find (sum, count) per key: a Monoid 
    sum_count_per_key = rdd2.reduceByKey(
      lambda x, y: (x[0]+y[0], x[1]+y[1])
    )

    # find average per key
    # v : (sum, count)
    avg_per_key = sum_count_per_key.mapValues(
       lambda v: float(v[0]) / v[1]
    )

Note that by mapping (key, value) to (key, (value, 1)) we make addition of values such as (sum, count) to be a monoid. Consider the follwing two partitions:

Partition-1        Partition-2
(A, 1)             (A, 3)
(A, 2)

By mapping (key, value) to (key, (value, 1)), we will have (as rdd2):

Partition-1        Partition-2
(A, (1, 1))         (A, (3, 1))
(A, (2, 1))

Then sum_count_per_key RDD will hold:

Partition-1        Partition-2
(A, (3, 2))         (A, (3, 1))

Finally, avg_per_key RDD will produce the final value per key: (A, 2).

What Does it Mean that "Average of Average is Not an Average"

In distributed computing environments (such as MapReduce, Hadoop, Spark, ...) correctness of algorithms are very very important. Let's say, we have only 2 partitions:

Partition-1        Partition-2
(A, 1)             (A, 3)
(A, 2)

and we want to calculate the average per key. Looking at these partitions, the average of (1, 2, 3) will be exactly 2.0. But since we are ina distributed environment, then the average will be calculated per partition:

Partition-1: avg(1, 2) = 1.5
Partition-2: avg(3) = 3.0

avg(Partition-1, Partition-2) = (1.5 + 3.0) / 2 = 2.25

===> which is NOT the correct average we were expecting.

To fix this problem, we can change the output of mappers: new revised output is as: (key, (sum, count)):

Partition-1        Partition-2
(A, (1, 1))        (A, (3, 1))
(A, (2, 1))

Now, let's calculate average:

Partition-1: avg((1, 1), (2, 1)) = (1+2, 1+1) = (3, 2)
Partition-2: avg((3, 1)) = (3, 1)
avg(Partition-1, Partition-2) = avg((3,2), (3, 1)) 
                              = avg(3+3, 2+1)
                              = avg(6, 3)
                              = 6 / 3
                              = 2.0
                              ===> CORRECT AVERAGE

Advantages of MapReduce

Is there any benefit in using MapReduce paradigm? With MapReduce, developers do not need to write code for parallelism, distributing data, or other complex coding tasks because those are already built into the model. This alone shortens analytical programming time.

The following are advantages of MapReduce:

What is a MapReduce Job

Job − A program is an execution of a Mapper and Reducer across a dataset. Minimally, a MapReduce job will have the following components:

Disadvantages of MapReduce

What the MapReduce's Job Flow

1-InputFormat: Splits input into (key_1, value_1) pairs and passes them to mappers. When Hadoop submits a job, it splits the input data logically (Input splits) and these are processed by each Mapper. The number of Mappers is equal to the number of input splits created. Hadoop's InputFormat.getSplits() function is responsible for generating the input splits which uses each split as input for each mapper job.

2-Mapper: map(key_1, value_1) emits a set of (key_2, value_2) pairs. If a mapper does not emit any (key, value) pairs, then it means that (key_1, value_1) is filtered out (for example, tossing out the invalid/bad records).

3-Combiner: [optional] combine(key_2, [value-2, ...]) emits (key_2, value_22). The combiner might emit no (key, value) pair if there is a filtering algorithm (based on the key (i.e., key_2 and its associated values)).

Note that value_22 is an aggregated value for [value-2, ...]

4-Sort & Shuffle: Group by keys of mappers with their associated values. If output of all mappers/combiners are:

    (K_1, v_1), (K_1, v_2), (K_1, v_3), ..., 
    (K_2, t_1), (K_2, t_2), (K_2, t_3), ...,
    ...
    (K_n, a_1), (K_n, a_2), (K_n, a_3), ...

Then output of Sort & Shuffle will be (which will be fed as an inut to reducers as (key, values):

    (K_1, [v_1, v_2, v_3, ...])
    (K_2, [t_1, t_2, t_3, ...])
    ...
    (K_n, [a_1, a_2, a_3, ...])

5-Reducer: We will have n reducers, sicnce we have n unique keys. All these reducers can run in parallel (if we have enough resources).

reduce(key, values) will emit a set of (key_3, value_3) pairs and eventually thay are written to output. Note that reducer key will be one of {K_1, K_2, ..., K_n}.

6-OutputForamt: Reponsible for writing (key_3, value_3) pairs to output medium. Note that some of the reducers might not emit any (key_3, value_3) pairs: this means that the reducer is filtering out some keys based of the associated values (for example, if the median of the values is less than 10, then filter out).

Hadoop vs. Spark

Feature Haoop Spark
Data Processing Provides batch processing Provides both batch processing and stream processing
Memory usage Disk-bound Uses large amounts of RAM
Security Better security features Basic security is provided
Fault Tolerance Replication is used for fault tolerance RDD and various data storage models are used for fault tolerance.
Graph Processing Must develop custom algorithms Comes with a graph computation library called GraphX and external library as GraphFrames
Ease of Use Difficult to use Easier to use
Powerful API Low level API High level API
Real-time Batch only Batch and Interactive and Stream
Interactive data processing Not supported Supported by PySpark, ...
Speed SLOW: Hadoop’s MapReduce model reads and writes from a disk, thus it slows down the processing speed. FAST: Spark reduces the number of read/write cycles to disk and store intermediate data in memory, hence faster-processing speed.
Latency It is high latency computing framework. It is a low latency computing and can process data interactively
Machine Learing API Not supported Supported by ML Library
Data Source Support Limited Extensive
Storage Has HDFS (Hadoop Distributed File System) Does not have a storage system, but may use S3 and HDFS and many other data sources and storages
MapReduce Implements MapReduce Implements superset of MapReduce and beyond
Join Operation Does not support Join directly Has extensive API for Join

Apache Spark

In a nutshell, we can say that Apache Spark is the most active open big data tool reshaping the big data market. Apache Spark is an engine for large-scale data analytics. Spark is a multi-language (Java, Scala, Python, R, SQL) engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Spark implements superset of MapReduce paradigm and uses memory/RAM as much as possible and can run up to 100 times faster than Hadoop. Spark is considered the successor of Hadoop/Mapreduce and has addressed many problems of Hadoop.

With using Spark, developers do not need to write code for parallelism, distributing data, or other complex coding tasks because those are already built into the spark engine. This alone shortens analytical programming time.

Apache Spark is one of the best alternatives to Hadoop and currently is the defacto standard for big data analytics. Spark offers simple API and provides high-level mappers, filters, and reducers.

Spark’s architecture consists of two main components:

PySpark is an interface for Spark in Python. PySpark has two main data abstractions:

Spark addresses many problems of hadoop:

Apache Spark Components

Apache Spark provides:

Apache Spark in Large-Scale Sorting

Spark Officially Sets a New Record in Large-Scale Sorting. Databricks team sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Apache Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.

DAG in Spark

Spark DAG (directed acyclic graph) is the strict generalization of the MapReduce model. The DAG operations can do better global optimization than the other systems like MapReduce. The Apache Spark DAG allows a user to dive into the stage and further expand on detail on any stage.

DAG in Spark is a set of vertices and edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD. In Spark DAG, every edge directs from earlier to later in the sequence. On the calling of Action, the created DAG submits to DAG Scheduler which further splits the graph into the stages of the task.

By using Spark Web UI, you can view Spark jobs and their associated DAGs.

Spark Concepts and Key Terms

Spark as a superset of MapReduce

Spark is a true successor of MapReduce and maintains MapReduce’s linear scalability and fault tolerance, but extends it in 7 important ways:

  1. Spark does not rely on a low-level and rigid map-then-reduce workflow. Spark's engine can execute a more general Directed Acyclic Graph (DAG) of operators. This means that in situations where MapReduce must write out intermediate results to the distributed file system (such as HDFS and S3), Spark can pass them directly to the next step in the pipeline. Rather than writing many map-then-reduce jobs, in Spark, you can use transformations in any order to have an optimized solution.

  2. Spark complements its computational capability with a simple and rich set of transformations and actions that enable users to express computation more naturally. Powerful and simple API (as a set of functions) are provided for various tasks including numerical computation, datetime processing and string manipulation.

  3. Spark extends its predecessors (such as Hadoop) with in-memory processing. MapReduce uses disk I/O (which is slow), but Spark uses in-memory computing as much as possible and it can be up to 100 times faster than MapReduce implementations. This means that future steps that want to deal with the same data set need not recompute it or reload it from disk. Spark is well suited for highly iterative algorithms as well as adhoc queries.

  4. Spark offers interactive environment (for example using PySpark interactively) for testing and debugging data transformations.

  5. Spark offers extensive Machine Learning libraries (Hadoop/MapReduce does not have this capability)

  6. Spark offers extensive graph API by GraphX (built-in) and GraphFrames (as an external library).

  7. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be pushed out to file systems, databases, and live dashboards.

What is an Spark RDD

Spark's RDD (full name in PySpark as: pyspark.RDD) is a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. RDD represents an immutable, partitioned collection of elements that can be operated on in parallel. Basically, an RDD represents your data (as a collection, text files, databases, Parquet files, JSON, CSV files, ...). Once your data is represented as an RDD, then you call apply transformations (such as filters, mappers, and reducers) to your RDD and create new RDDs.

An RDD can be created from many data sources such as Python collections, text files, CSV files, JSON, ...

An RDD is more suitable to unstructured and semi-structured data (while a DataFrame is more suitable to structured and semi-structured data.

What are Spark Mappers?

Spark offers comprehensive mapper functions for RDDs and DataFrames.

Mappers for RDDs:

Mappers for DataFrames:

Mappers for Spark Dataframes can be handled by two means:

What are Spark Reducers?

Spark offers comprehensive reducer functions for RDDs and DataFrames.

Reducers for RDDs:

Reducers for DataFrames:

Reductions for Spark Dataframes can be handled by two means:

Difference between Spark's Action and Transformation

A Spark transformation (such as map(), filter(), reduceByKey(), ...) applies to a source RDD and creates a new target RDD. While, an action (such as collect(), save(), ...) applies to a source RDD and creates a non-RDD element (such as a number or another data structure).

In Spark, if a function returns a DataFrame, Dataset, or RDD, it is a transformation. If it returns anything else or does not return a value at all (or returns Unit in the case of Scala API), it is an action.

What is Lineage In Spark?

Spark RDDs are immutable (READ-ONLY) distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure.

What are Spark operations/functions?

Two types of Spark RDD operations are: Transformations and Actions.

The Spark Programming Model

Spark programming starts with a data set (which can be represented as an RDD or a DataFame), usually residing in some form of distributed, persistent storage like Amazon S3 or Hadoop HDFS. Writing a Spark program typically consists of a few related steps:

  1. Define a set of transformations on the input data set.

  2. Invoke actions that output the transformed data sets to persistent storage or return results to the driver’s local memory.

  3. Run local computations that operate on the results computed in a distributed fashion. These can help you decide what transformations and actions to undertake next.

What is Lazy Binding In Spark?

Lazy binding/evaluation in Spark means that the execution of transformations will not start until an action is triggered.

In programming language theory, lazy evaluation, or call-by-need, is an evaluation strategy which delays the evaluation of an expression until its value is needed (non-strict evaluation) and which also avoids repeated evaluations (sharing).

Difference between reduceByKey() and combineByKey()

RDD.reduceByKey() merges the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.

This can be expressed as:

    reduceByKey: RDD[(K, V)] --> RDD[(K, V)]

RDD.combineByKey() is a generic function to combine the elements for each key using a custom set of aggregation functions. RDD.combineByKey() turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a “combined type” C.

For combineByKey(), users provide three functions:

This can be expressed as:

    combineByKey: RDD[(K, V)] --> RDD[(K, C)]

    where V and C can be the same or different 

What is an example of RDD.combineByKey()?

Combine all of values per key.


# combineByKey: RDD[(String, Integer)] --> RDD[(String, [Integer])]

rdd = sc.parallelize([("a", 1), ("b", 7), ("a", 2), ("a", 3), ("b", 8), ("z", 5)])

# V --> C
def to_list(a):
    return [a]

# C x V --> C
def append(a, b):
    a.append(b)
    return a

# C x C --> C
def extend(a, b):
    a.extend(b)
    return a

# rdd: RDD[(String, Integer)]
# rdd2: RDD[(String, [Integer])]
rdd2 = rdd.combineByKey(to_list, append, extend)
rdd2.collect()

[
 ('z', [5]),
 ('a', [1, 2, 3]), 
 ('b', [7, 8])
]

# Note that values of keys does not need to be sorted

What is an example of RDD.reduceByKey()?

Find maximum of values per key.


# reduceByKey: RDD[(String, Integer)] --> RDD[(String, Integer)]

rdd = sc.parallelize([("a", 1), ("b", 7), ("a", 2), ("a", 3), ("b", 8), ("z", 5)])

# rdd: RDD[(String, Integer)]
# rdd2: RDD[(String, Integer)]
rdd2 = rdd.reduceByKey(lambda x, y: max(x, y))
rdd2.collect()

[
 ('z', 5),
 ('a', 3), 
 ('b', 8)
]

What is an example of RDD.groupByKey()?

Combine/Group values per key.


# reduceByKey: RDD[(String, Integer)] --> RDD[(String, [Integer])]

rdd = sc.parallelize([("a", 1), ("b", 7), ("a", 2), ("a", 3), ("b", 8), ("z", 5)])

# rdd: RDD[(String, Integer)]
# rdd2: RDD[(String, [Integer])]
rdd2 = rdd.groupByKey()
rdd2.collect()

[
 ('z', [5]),
 ('a', [1, 2, 3]), 
 ('b', [7, 8])
]

Difference of RDD.groupByKey() and RDD.reduceByKey()

Both reduceByKey() and groupByKey() result in wide transformations which means both triggers a shuffle operation. The key difference between reduceByKey() and groupByKey() is that reduceByKey() does a map side combine and groupByKey() does not do a map side combine. Overall, reduceByKey() is optimized with a map side combine. Note that the reducer function for the reduceByKey() must be associative and commutative.

What is a DataFrame?

A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet or a relational table. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

Python DataFrame Example

DataFrame is a 2-dimensional mutable labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object. A Pandas DataFrame is a 2-dimensional data structure, like a 2-dimensional array, or a table with rows and columns. The number of rows for Pandas DataFrame is mutable and limited to the computer and memory where it resides.

import pandas as pd

data = {
  "calories": [100, 200, 300],
  "duration": [50, 60, 70]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df) 

# Result:

      calories  duration
  0       100        50
  1       200        60
  2       300        70

Spark DataFrame Example

A distributed collection of data grouped into named columns. Spark's DataFrame is immutable and can have billions of rows. A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession:

# PySpark code:

input_path = "..."
# spark: as a SparkSession object
people = spark.read.parquet(input_path)

Once created, it can be manipulated using the various domain-specific-language (DSL) functions or you may use SQL to execute queries against DataFrame (registered as a table).

A more concrete example:

# PySpark code:

# To create DataFrame using SparkSession
input_path_people = "..."
people = spark.read.parquet(input_path_people)
input_path_dept = "..."
department = spark.read.parquet(input_path_dept)

result = people.filter(people.age > 30)\
               .join(department, people.deptId == department.id)\
               .groupBy(department.name, "gender")\
               .agg({"salary": "avg", "age": "max"})

What is an Spark DataFrame?

Spark's DataFrame (full name as: pyspark.sql.DataFrame) is an immutable and distributed collection of data grouped into named columns. Once your DataFrame is created, then your DataFrame can be manipulated and transformed into another DataFrame by DataFrame's native API and SQL.

A DataFrame can be created from Python collections, relational databases, Parquet files, JSON, CSV files, ...).

DataFrame is more suitable to structured and semi-structured data (while an RDD is more suitable to unstructured and semi-structured data).

Spark RDD Example

An Spark RDD can represent billions of elements.

    >>> sc
    <SparkContext master=local[*] appName=PySparkShell>
    >>> sc.version
    '3.3.1'
    >>> numbers = sc.parallelize(range(0,1000))
    >>> numbers
    PythonRDD[1] at RDD at PythonRDD.scala:53
    >>> numbers.count()
    1000
    >>> numbers.take(5)
    [0, 1, 2, 3, 4]
    >>> numbers.getNumPartitions()
    16
    >>> total = numbers.reduce(lambda x, y: x+y)
    >>> total
    499500

Spark DataFrame Example

A Spark DataFrame can represent billions of rows of named columns.

    >>> records = [("alex", 23), ("jane", 24), ("mia", 33)]
    >>> spark
    <pyspark.sql.session.SparkSession object at 0x12469e6e0>
    >>> spark.version
    '3.3.1'
    >>> df = spark.createDataFrame(records, ["name", "age"])
    >>> df.show()
    +----+---+
    |name|age|
    +----+---+
    |alex| 23|
    |jane| 24|
    | mia| 33|
    +----+---+
    
    >>> df.printSchema()
    root
     |-- name: string (nullable = true)
     |-- age: long (nullable = true)

Join Operation in MapReduce

The MapReduce paradigm does not have a direct join API. But the join can be implemented as a set of custom mappers and reducers.

Below, an inner join is presented for MapReduce:

Let R be a relation as (K, a1, a2, ...), where K is a key and a1, a2, ... are additional attributes of R, which we denote it as (K, A), where A denotes attributes (a1, a2, ...).

Let S be a relation as (K, b1, b2, ...), where K is a key and b1, b2, ... are additional attributes of S, which we denote it as (K, B), where B denotes attributes (b1, b2, ...).

We want to implement R.join(S), which will return (K, (a, b)), where (K, a) is in R and (K, b) is in S.

Step-1: Map relation R: inject the name of relation into an output value as:

input (K, a)
output: (K, ("R", a))

Step-2: Map relation S: inject the name of relation into an output value as:

input (K, b)
output: (K, ("S", b))

Step-3: Merge outputs of Step-1 and Step-2 into /tmp/merged_input/, which will be used as an input path for Step-4 (as an identity mapper):

Step-4: is an identity mapper:

# key: as K
# value as: ("R", a) OR ("S", b)
map(key, value) {
    emit(key, value)
}

Step-4.5: Sort & Shuffle (provided by MapReduce implementation): will create (key, value) pairs as:

(K, Iterable<(relation, attribute)>

where K is the common key of R and S, relation is either "R" or "S", and attribe is either a in A or b in B.

Step-5: Reducer

# key as K is the common key of R and S
# values : Iterable<(relation, attribute)>
reduce(key, values) {
   # create two lists: one for R and another one for S
   R_list = []
   S_list = []

   # iterate values and update R_list and S_list
   for pair in values {
      relation = pair[0]
      attribute = pair[1]
      if (relation == "R") {
         R_list.append(attribute)
      }
      else { 
         S_list.append(attribute)
        }
    } #end-for

  if (len(R_list) == 0) or (len(S_list) == 0) {
    # no join, no common attributes
    return
  }

  # Both lists are non-empty:
  # len(R_list) > 0) and len(S_list) > 0
  for a in R {
     for b in S {
         emit (key, (a, b))
     }
  }
} # end-reduce

The left-join and right-join can be implemented by revising the reducer function.

Example: Demo Inner Join

Relation R:

    (x, 1)
    (x, 2)
    (y, 3)
    (y, 4)
    (z, 5)

Relation S:

    (x, 22)
    (x, 33)
    (y, 44)
    (p, 55)
    (p, 66)
    (p, 77)

Step-1: output:

    (x, ("R", 1))
    (x, ("R", 2))
    (y, ("R", 3))
    (y, ("R", 4))
    (z, ("R", 5))

Step-2: output:

    (x, ("S", 22))
    (x, ("S", 33))
    (y, ("S", 44))
    (p, ("S", 55))
    (p, ("S", 66))
    (p, ("S", 77))

Step-3: combine outputs of Step-1 and Step-2:

    (x, ("R", 1))
    (x, ("R", 2))
    (y, ("R", 3))
    (y, ("R", 4))
    (z, ("R", 5))
    (x, ("S", 22))
    (x, ("S", 33))
    (y, ("S", 44))
    (p, ("S", 55))
    (p, ("S", 66))
    (p, ("S", 77))

Step-4: Identity Mapper output:

    (x, ("R", 1))
    (x, ("R", 2))
    (y, ("R", 3))
    (y, ("R", 4))
    (z, ("R", 5))
    (x, ("S", 22))
    (x, ("S", 33))
    (y, ("S", 44))
    (p, ("S", 55))
    (p, ("S", 66))
    (p, ("S", 77))

Step-4.5: Sort & Shuffle output:

(x, [("R", 1), ("R", 2), ("S", 22), ("S", 33)])
(y, [("R", 3), ("R", 4), ("S", 44)])
(z, [("R", 5)])
(p, [("S", 55), ("S", 66), ("S", 77)])

Step-5: Reducer output:

(x, (1, 22))
(x, (1, 33))
(x, (2, 22))
(x, (2, 33))    
(y, (3, 44))
(y, (4, 44))

Join Operation in Spark

Spark has an extensive support for join operation.

Join in RDD

Let A be an RDD[(K, V)] and B be an RDD[(K, U)], then A.join(B) will return a new RDD (call it as C) as RDD[(K, (V, U)]. Each pair of C elements will be returned as a (k, (v, u)) tuple, where (k, v) is in A and (k, u) is in B. Spark performs a hash join across the cluster.

Example:

# sc : SparkContext
x = sc.parallelize([("a", 1), ("b", 4), ("c", 6), ("c", 7)])
y = sc.parallelize([("a", 2), ("a", 3), ("c", 8), ("d", 9)])
x.join(y).collect())
[
 ('a', (1, 2)), 
 ('a', (1, 3)),
 ('c', (6, 8)),
 ('c', (7, 8))
]

Join in DataFrame

# PySpark API:

DataFrame.join(other: pyspark.sql.dataframe.DataFrame, 
               on: Union[str, List[str], 
               pyspark.sql.column.Column,    
               List[pyspark.sql.column.Column], None] = None, 
               how: Optional[str] = None)
                → pyspark.sql.dataframe.DataFrame
               
Joins with another DataFrame, using the given join expression.

Example: inner join

# SparkSession available as 'spark'.
>>> emp = [(1, "alex", "100", 33000), \
...        (2, "rose", "200", 44000), \
...        (3, "bob", "100", 61000), \
...        (4, "james", "100", 42000), \
...        (5, "betty", "400", 35000), \
...        (6, "ali", "300", 66000) \
...   ]
>>> emp_columns = ["emp_id", "name", "dept_id", "salary"]
>>> emp_df = spark.createDataFrame(data=emp, schema = emp_columns)
>>>
>>> emp_df.show()
+------+-----+-------+------+
|emp_id| name|dept_id|salary|
+------+-----+-------+------+
|     1| alex|    100| 33000|
|     2| rose|    200| 44000|
|     3|  bob|    100| 61000|
|     4|james|    100| 42000|
|     5|betty|    400| 35000|
|     6|  ali|    300| 66000|
+------+-----+-------+------+

>>> dept = [("Finance", 100), \
...         ("Marketing", 200), \
...         ("Sales", 300), \
...         ("IT", 400) \
...   ]
>>> dept_columns = ["dept_name", "dept_id"]
>>> dept_df = spark.createDataFrame(data=dept, schema = dept_columns)
>>> dept_df.show()
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|  Finance|    100|
|Marketing|    200|
|    Sales|    300|
|       IT|    400|
+---------+-------+

>>> joined = emp_df.join(dept_df, emp_df.dept_id ==  dept_df.dept_id, "inner")
>>> joined.show()
+------+-----+-------+------+---------+-------+
|emp_id| name|dept_id|salary|dept_name|dept_id|
+------+-----+-------+------+---------+-------+
|     1| alex|    100| 33000|  Finance|    100|
|     3|  bob|    100| 61000|  Finance|    100|
|     4|james|    100| 42000|  Finance|    100|
|     2| rose|    200| 44000|Marketing|    200|
|     6|  ali|    300| 66000|    Sales|    300|
|     5|betty|    400| 35000|       IT|    400|
+------+-----+-------+------+---------+-------+

>>> joined = emp_df.join(dept_df, emp_df.dept_id ==  dept_df.dept_id, "inner")
                   .drop(dept_df.dept_id)
>>> joined.show()
+------+-----+-------+------+---------+
|emp_id| name|dept_id|salary|dept_name|
+------+-----+-------+------+---------+
|     1| alex|    100| 33000|  Finance|
|     3|  bob|    100| 61000|  Finance|
|     4|james|    100| 42000|  Finance|
|     2| rose|    200| 44000|Marketing|
|     6|  ali|    300| 66000|    Sales|
|     5|betty|    400| 35000|       IT|
+------+-----+-------+------+---------+

Spark Partitioning

A partition in spark is an atomic chunk of data (logical division of data) stored on a node in the cluster. Partitions are basic units of parallelism in Apache Spark. RDDs and DataFrames in Apache Spark are collection of partitions.

Data (represented as an RDD or DataFrame) partitioning in Spark helps achieve more parallelism. For example, if your RDD/DataFrame is partitioned into 100 chunks/partitions, then for RDD.map(), there is a chance of running 100 mappers in parallel/concurrently (at the same time). Therefore, Spark RDDs and DataFrames are stored in partitions and operated in parallel.

For example, in PySpark you can get the current number/length/size of partitions by running RDD.getNumPartitions().

Physical Data Partitioning

Physical Data Partitioning is a technique used in data warehouses and big data query engines.

Physical Data Partitioning is a way to organize a very large data into several smaller data based on one or multiple columns (partition key, for example, continent, country, date, state e.t.c).

The main point of Physical Data Partitioning is to analyze slice of a data rather than the whole data. For example, if we have a temprature data for 7 continents, and we are going to query data based on the continent name, then to reduce the query time, we can partition data by the continent name: this enables us to query slice of a data (such as: continent_name = asia) rather than the whole data. When we phisically partition data, we create separate folder (or directory) per partitioned column.

In Spark, for Physical Data Partitioning, you may use pyspark.sql.DataFrameWriter.partitionBy().

PySpark Example:

output_path = "...target-output-path..."
df.write.partitionBy('continent')\
  .parquet(output_path)

For example, given a DataFrame as:

DataFrame(continent, country, city, temprature)

Then partitioning by continent, the following physical folders/directories will be created (for example, if we had data for 7 continents, then 7 folders will be created):

<output_path>
      |
      +------- continent=Asia   --- <data-for-asia>
      |
      +------- continent=Europe --- <data-for-europe>
      |
      + ...

For details, refer to Physical Data Partitioning tutorial.

GraphFrames

GraphFrames is an external package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX (included in Spark API) and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.

GraphFrames are to DataFrames as GraphX is to RDDs.

To build a graph, you build 2 DataFrames (one for vertices and another one for the edges) and then glue them together to create a graph:

# each node is identified by "id" and an optional attributes
# vertices: DataFrame(id, ...)

# each edge is identified by (src, dst) and an optional attributes
# where src and dst are node ids
# edges: DataFrame(src, dst, ...)

# import required GraphFrame library
from graphframes import GraphFrame

# create a new directed graph
graph = GraphFrame(vertices, edges)

Example of a GraphFrame

This example shows how to build a directed graph using graphframes API.

To invoke PySpark with GraphFrames:

% # define the home directory for Spark
% export SPARK_HOME=/home/spark-3.2.0
% # import graphframes library into PySpark and invoke interactive PySpark:
% $SPARK_HOME/bin/pyspark --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12
Python 3.8.9 (default, Mar 30 2022, 13:51:17)
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0
      /_/

Using Python version 3.8.9 (default, Mar 30 2022 13:51:17)
Spark context Web UI available at http://10.0.0.234:4040
Spark context available as 'sc' (master = local[*], app id = local-1650670391027).
SparkSession available as 'spark'.
>>>

Then PySpark is ready to use GraphFrames API:

>>># create list of nodes
>>> vert_list = [("a", "Alice", 34),
...              ("b", "Bob", 36),
...              ("c", "Charlie", 30)]
>>>
>>># define column names for a node
>>> column_names_nodes = ["id", "name", "age"]
>>>
>>># create vertices_df as a Spark DataFrame
>>> vertices_df = spark.createDataFrame(
...      vert_list,
...      column_names_nodes
... )

>>>
>>># create list of edges
>>> edge_list = [("a", "b", "friend"),
...              ("b", "c", "follow"),
...              ("c", "b", "follow")]
>>>
>>># define column names for an edge
>>> column_names_edges = ["src", "dst", "relationship"]
>>>
>>># create edges_df as a Spark DataFrame
>>> edges_df = spark.createDataFrame(
...           edge_list,
...           column_names_edges
... )
>>>
>>># import required libriaries
>>> from graphframes import GraphFrame
>>>
>>># build a graph using GraphFrame library
>>> graph = GraphFrame(vertices_df, edges_df)
>>>
>>># examine built graph
>>> graph
GraphFrame(
  v:[id: string, name: string ... 1 more field], 
  e:[src: string, dst: string ... 1 more field]
)
>>>
>>># access vertices of a graph
>>> graph.vertices.show()
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  a|  Alice| 34|
|  b|    Bob| 36|
|  c|Charlie| 30|
+---+-------+---+

>>># access edges of a graph
>>> graph.edges.show()
+---+---+------------+
|src|dst|relationship|
+---+---+------------+
|  a|  b|      friend|
|  b|  c|      follow|
|  c|  b|      follow|
+---+---+------------+

Advantages of using Spark

GraphX

GraphX is Apache Spark's API (RDD-based) for graphs and graph-parallel computation, with a built-in library of common algorithms. GraphX has API for Java and Scala, but does not have an API for Python (therefore, PySpark does not support GraphX, but PySpark supports GraphFrames).

Cluster

Cluster is a group of servers on a network that are configured to work together. A server is either a master node or a worker node. A cluster may have a master node and many worker nodes. In a nutshell, a master node acts as a cluster manager.

A cluster may have one (or two) master nodes and many worker nodes. For example, a cluster of 15 nodes: one master and 14 worker nodes. Another example: a cluster of 101 nodes: one master and 100 worker nodes.

A cluster may be used for running many jobs (Spark and MapReduce jobs) at the same time.

Master node

In Hadoop, Master nodes (set of one or more nodes) are responsible for storing data in HDFS and overseeing key operations, such as running parallel computations on the data using MapReduce. The worker nodes comprise most of the virtual machines in a Hadoop cluster, and perform the job of storing the data and running computations.

Hadoop-Master-Worker: the following images shows Hadoop's master node and worker nodes.

In Spark, the master node contains driver program, which drives the application by creating Spark context object. Spark context object works with cluster manager to manage different jobs. Worker nodes job is to execute the tasks and return the results to Master node.

Spark-Master-Worker: the following images shows master node and 2 worker nodes.

Worker node

In Hadoop, the worker nodes comprise most of the virtual machines in a Hadoop cluster, and perform the job of storing the data and running computations. Each worker node runs the DataNode and TaskTracker services, which are used to receive the instructions from the master nodes.

In Spark, worker node is ny node that can run application code in the cluster. Executor is a process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.

Cluster computing

Cluster computing is a collection of tightly or loosely connected computers that work together so that they act as a single entity. The connected computers execute operations all together thus creating the idea of a single system. The clusters are generally connected through fast local area networks (LANs). A cluster computing is comprised of a one or more masters (manager for the whole cluster) and many worker nodes. For example, a cluster computer may have a single master node (which might not participate in tasks such as mappers and reducers) and 100 worker nodes (which actively participate in carrying tasks such as mappers and reducers). A small cluster might have one master node and 5 worker nodes. Large clusters might have hundreds or thousands of worker nodes.

Concurrency

Performing and executing multiple tasks and processes at the same time. Let's define 5 tasks {T1, T2, T3, T4, T5} where each will take 10 seconds. If we execute these 5 tasks in sequence, then it will take about 50 seconds, while if we execute all of them in parallel, then the whole thing will take about 10 seconds. Cluster computing enables concurrency and parallelism.

Histogram

A graphical representation of the distribution of a set of numeric data, usually a vertical bar graph

Structured data

Structured data — typically categorized as quantitative data — is highly organized and easily decipherable by machine learning algorithms. Developed by IBM in 1974, structured query language (SQL) is the programming language used to manage structured data. By using a relational (SQL) database, business users can quickly input, search and manipulate structured data. In structured data, each record has a precise record format. Structured data is identifiable as it is organized in structure like rows and columns.

Unstructured data

n the modern world of big data, unstructured data is the most abundant. It’s so prolific because unstructured data could be anything: media, imaging, audio, sensor data, log data, text data, and much more. Unstructured simply means that it is datasets (typical large collections of files) that aren’t stored in a structured database format. Unstructured data has an internal structure, but it’s not predefined through data models. It might be human generated, or machine generated in a textual or a non-textual format. Unstructured data is regarded as data that is in general text heavy, but may also contain dates, numbers and facts.

Correlation analysis

The analysis of data to determine a relationship between variables and whether that relationship is negative (- 1.00) or positive (+1.00).

Data aggregation tools

The process of transforming scattered data from numerous sources into a single new one.

Data analyst

Someone analysing, modelling, cleaning or processing data

Database

A digital collection of data stored via a certain technique. In computing, a database is an organized collection of data (rows or objects) stored and accessed electronically.

Database Management System

Collecting, storing and providing access of data.

Data cleansing

The process of reviewing and revising data in order to delete duplicates, correct errors and provide consistency

Data mining

The process of finding certain patterns or information from data sets

Data virtualization

A data integration process in order to gain more insights. Usually it involves databases, applications, file systems, websites, big data techniques, etc.)

De-identification

Same as anonymization; ensuring a person cannot be identified through the data

ETL (Extract, Transform and Load)

ETL is a process in a database and data warehousing meaning extracting the data from various sources, transforming it to fit operational needs and loading it into the database or some storage. For example, processing DNA data, creating output records in specific Parquet format and loading it to Amazon S3 is an ETL process.

Failover

Switching automatically to a different server or node should one fail Fault-tolerant design – a system designed to continue working even if certain parts fail Feature - a piece of measurable information about something, for example features you might store about a set of people, are age, gender and income.

Graph Databases

Graph databases are purpose-built to store and navigate relationships. Relationships are first-class citizens in graph databases, and most of the value of graph databases is derived from these relationships. Graph databases use nodes to store data entities, and edges to store relationships between entities. An edge always has a start node, end node, type, and direction, and an edge can describe parent-child relationships, actions, ownership, and the like. There is no limit to the number and kind of relationships a node can have.

Grid computing

Connecting different computer systems from various location, often via the cloud, to reach a common goal

Key-Value Databases

Key-Value Databases store data with a primary key, a uniquely identifiable record, which makes easy and fast to look up. The data stored in a Key-Value is normally some kind of primitive of the programming language. As a dictionary, for example, Redis allows you to set and retrieve pairs of keys and values. Think of a “key” as a unique identifier (string, integer, etc.) and a “value” as whatever data you want to associate with that key. Values can be strings, integers, floats, booleans, binary, lists, arrays, dates, and more.

(key, value)

The (key, value) notation is used in many places (such as Spark) and in MapReduce Paradigm. In MapReduce paradigm everything works as a (key, value). Note that the key and value can be

In MapReduce, map() and reduce() use (key, value) pairs:

The Map output types should match the input types of the Reduce as shown below:

# mapper can emit 0, 1, 2, ... of (K2, V2)
map(K1, V1) -> { (K2, V2) }

# reducer can emit 0, 1, 2, ... of (K3, V3)
# K2 is a unique key from mapper's outputs
# [V2, ...] are all values associated with key K2
reduce(K2, [V2, ...]) -> { (K3, V3) }

In Spark, using RDDs, a source RDD must be in (key, value) form before we can apply reduction transformations such as groupByKey(), reduceByKey(), and combineByKey().

Java

Java is a programming language and computing platform first released by Sun Microsystems in 1995. It has evolved from humble beginnings to power a large share of today’s digital world, by providing the reliable platform upon which many services and applications are built. New, innovative products and digital services designed for the future continue to rely on Java, as well.

Python

Python is a programming language that lets you work quickly and integrate systems more effectively. Python is an interpreted, object-oriented (not fully) programming language that's gained popularity for big data professionals due to its readability and clarity of syntax. Python is relatively easy to learn and highly portable, as its statements can be interpreted in several operating systems.

JavaScript

A scripting language designed in the mid-1990s for embedding logic in web pages, but which later evolved into a more general-purpose development language.

In-memory

A database management system stores data on the main memory instead of the disk, resulting is very fast processing, storing and loading of the data Internet of Things – ordinary devices that are connected to the internet at any time anywhere via sensors

Latency

A measure of time delayed in a system

Location data

GPS data describing a geographical location

Machine Learning

Part of artificial intelligence where machines learn from what they are doing and become better over time. Apache Spark offers a comprehensive Maching Learning library for big data. In a nutshell, Machine learning is an application of AI that enables systems to learn and improve from experience without being explicitly programmed.

There are many ML packages for experimentation:

Metadata

Data about data; gives information about what the data is about.

For example, author, date created, date modified and file size are examples of very basic document file metadata.

Table definition for a relational table is an example of metadata.

Natural Language Processing (NLP)

A field of computer science involved with interactions between computers and human languages.

Open source software for NLP: The Stanford Natural Language Processing

Network analysis

Viewing relationships among the nodes in terms of the network or graph theory, meaning analysing connections between nodes in a network and the strength of the ties.

Workflow

A graphical representation of a set of events, tasks, and decisions that define a business process (example: vacation approval process in a company; purchase approval process). You use the developer tool to add objects to a workflow and to connect the objects with sequence flows. The Data Integration Service uses the instructions configured in the workflow to run the objects.

Schema

In computer programming, a schema (pronounced SKEE-mah) is the organization or structure for a database, while in artificial intelligence (AI) a schema is a formal expression of an inference rule. For the former, the activity of data modeling leads to a schema.

Difference between Tuple and List in Python

The primary difference between tuples and lists is that tuples are immutable as opposed to lists which are mutable. Therefore, it is possible to change a list but not a tuple. The contents of a tuple cannot change once they have been created in Python due to the immutability of tuples.

Examples in Python3:

# create a tuple
>>> t3 = (10, 20, 40)
>>> t3
(10, 20, 40)

# create a list
>>> l3 = [10, 20, 40]
>>> l3
[10, 20, 40]

# add an element to a list
>>> l3.append(500)
>>> l3
[10, 20, 40, 500]

# add an element to a tuple
>>> t3.append(600)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'

Object Databases

An object database store data in the form of objects, as used by object-oriented programming. They are different from relational or graph databases and most of them offer a query language that allows object to be found with a declarative programming approach.

Pattern Recognition

Pattern Recognition identifies patterns in data via algorithms to make predictions of new data coming from the same source.

Predictive analysis

Analysis within big data to help predict how someone will behave in the (near) future. It uses a variety of different data sets such as historical, transactional, or social profile data to identify risks and opportunities.

Privacy

To seclude certain data / information about oneself that is deemed personal Public data – public information or data sets that were created with public funding

Query

Asking for information to answer a certain question

Regression analysis

To define the dependency between variables. It assumes a one-way causal effect from one variable to the response of another variable.

Real-time data

Data that is created, processed, stored, analysed and visualized within milliseconds

Scripting

The use of a computer language where your program, or script, can be run directly with no need to first compile it to binary code. Semi-structured data - a form a structured data that does not have a formal structure like structured data. It does however have tags or other markers to enforce hierarchy of records.

Sentiment Analysis

Using algorithms to find out how people feel about certain topics or events

SQL

A programming language for retrieving data from a relational database. Also, SQL is used to retrieve data from big data by translating query into mappers, filters, and reducers.

SQL is a standard language for accessing and manipulating databases (relational and non-relational).

What is SQL?

What Can SQL do?

Time series analysis

Analysing well-defined data obtained through repeated measurements of time. The data has to be well defined and measured at successive points in time spaced at identical time intervals.

Variability

It means that the meaning of the data can change (rapidly). In (almost) the same tweets for example a word can have a totally different meaning

What are the 4 Vs of Big Data?

Variety

Data today comes in many different formats: structured data, semi-structured data, unstructured data and even complex structured data

Velocity

The speed at which the data is created, stored, analysed and visualized

Veracity

Ensuring that the data is correct as well as the analyses performed on the data are correct.

Volume

The amount of data, ranging from megabytes to gigabytes to petabytes to ...

XML Databases

XML Databases allow data to be stored in XML format. The data stored in an XML database can be queried, exported and serialized into any format needed.

Big Data Scientist

Someone who is able to develop the distributed algorithms to make sense out of big data

Classification analysis

A systematic process for obtaining important and relevant information about data, also meta data called; data about data.

Cloud computing

A distributed computing system over a network used for storing data off-premises. This can include ETL, data storage, application development, and data analytics. Examples: Amazon Cloud and Google Cloud.

Cloud computing is one of the must-known big data terms. It is a new paradigm computing system which offers visualization of computing resources to run over the standard remote server for storing data and provides IaaS, PaaS, and SaaS. Cloud Computing provides IT resources such as Infrastructure, software, platform, database, storage and so on as services. Flexible scaling, rapid elasticity, resource pooling, on-demand self-service are some of its services.

Clustering analysis

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

Database-as-a-Service

A database hosted in the cloud on a pay per use basis, for example Amazon Web Services

Database Management System (DBMS)

Database Management System is software that collects data and provides access to it in an organized layout. It creates and manages the database. DBMS provides programmers and users a well-organized process to create, update, retrieve, and manage data.

Distributed File System

Systems that offer simplified, highly available access to storing, analysing and processing data; examples are:

Document Store Databases

A document-oriented database that is especially designed to store, manage and retrieve documents, also known as semi structured data.

NoSQL

NoSQL sometimes referred to as ‘Not only SQL' as it is a database that doesn't adhere to traditional relational database structures. It is more consistent and can achieve higher availability and horizontal scaling. NoSQL is an approach to database design that can accommodate a wide variety of data models, including key-value, document, columnar and graph formats. NoSQL, which stands for "not only SQL," is an alternative to traditional relational databases in which data is placed in tables and data schemais carefully designed before the database is built. NoSQL databases are especially useful for working with large sets of distributed data.

Scala

A software programming language that blends object-oriented methods with functional programming capabilities. This allows it to support a more concise programming style which reduces the amount of code that developers need to write. Another benefit is that Scala features, which operate well in smaller programs, also scale up effectively when introduced into more complex environments.

Columnar Database

A database that stores data column by column instead of the row is known as the column-oriented database.

Data Analyst

The data analyst is responsible for collecting, processing, and performing statistical analysis of data. A data analyst discovers the ways how this data can be used to help the organization in making better business decisions. It is one of the big data terms that define a big data career. Data analyst works with end business users to define the types of the analytical report required in business.

Data Scientist

Data Scientist is also a big data term that defines a big data career. A data scientist is a practitioner of data science. He is proficient in mathematics, statistics, computer science, and/or data visualization who establish data models and algorithms for complex problems to solve them.

Data Model and Data Modelling

Data Model is a starting phase of a database designing and usually consists of attributes, entity types, integrity rules, relationships and definitions of objects.

Data modeling is the process of creating a data model for an information system by using certain formal techniques. Data modeling is used to define and analyze the requirement of data for supporting business processes.

Hive

Hive is an open source Hadoop-based data warehouse software project for providing data summarization, analysis, and query. Users can write queries in the SQL-like language known as HiveQL. Hadoop is a framework which handles large datasets in the distributed computing environment.

Load Balancing

Load balancing is a tool which distributes the amount of workload between two or more computers over a computer network so that work gets completed in small time as all users desire to be served faster. It is the main reason for computer server clustering and it can be applied with software or hardware or with the combination of both.

Load balancing refers to distributing workload across multiple computers or servers in order to achieve optimal results and utilization of the system

Log File

A log file is the special type of file that allows users keeping the record of events occurred or the operating system or conversation between the users or any running software.

Log file is a file automatically created by a computer program to record events that occur while operational.

Parallel Processing

It is the capability of a system to perform the execution of multiple tasks simultaneously (at the same time)

In parallel processing, we take in multiple different forms of information at the same time. This is especially important in vision. For example, when you see a bus coming towards you, you see its color, shape, depth, and motion all at once.

Parallel processing is a method in computing of running two or more processors (CPUs) to handle separate parts of an overall task. Breaking up different parts of a task among multiple processors will help reduce the amount of time to run a program.

For example, Spark uses Resilient Distributed Datasets (RDD) to perform parallel processing across a cluster or computer processors.

Server (or node)

The server is a virtual or physical computer that receives requests related to the software application and thus sends these requests over a network. It is the common big data term used almost in all the big data technologies.

Abstraction layer

A translation layer that transforms high-level requests into low-level functions and actions. Data abstraction sees the essential details needed to perform a function removed, leaving behind the complex, unnecessary data in the system. The complex, unneeded data is hidden from the client, and a simplified representation is presented. A typical example of an abstraction layer is an API (application programming interface) between an application and an operating system.

For example, Spark offers three types of data abstractions (it means that your data can be represented in RDD, DataFrame, and Dataset):

Cloud

Cloud technology, or The Cloud as it is often referred to, is a network of servers that users access via the internet and the applications and software that run on those servers. Cloud computing has removed the need for companies to manage physical data servers or run software applications on their own devices - meaning that users can now access files from almost any location or device.

The cloud is made possible through virtualisation - a technology that mimics a physical server but in virtual, digital form, A.K.A virtual machine.

Data ingestion

Data ingestion is the process of moving data from various sources into a central repository such as a data warehouse where it can be stored, accessed, analysed, and used by an organisation.

Common examples of data ingestion include:

Data warehouse

A centralised repository of information that enterprises can use to support business intelligence (BI) activities such as analytics. Data warehouses typically integrate historical data from various sources.

For big data, these are the data warehouse platforms on the market:

Open-source

Open-source refers to the availability of certain types of code to be used, redistributed and even modified for free by other developers. This decentralised software development model encourages collaboration and peer production.

The most popular open-source software is from Apache.

Prime examples of open-source products are:

Relational database

The relational term here refers to the relations (also commonly referred to as tables) in the database - the tables and their relationships to each other. The tables 'relate' to each other. It is these relations (tables) and their relationships that make it relational.

A relational database exists to house and identify data items that have pre-defined relationships with one another. Relational databases can be used to gain insights into data in relation to other data via sets of tables with columns and rows. In a relational database, each row in the table has a unique ID referred to as a key.

What do you mean by relational database? a relational database is a collection of information (stored as rows) that organizes data in predefined relationships where data is stored in one or more tables (or "relations") of columns and rows, making it easy to see and understand how different data structures relate to each other.

There are 3 different types of relations in the database:

RDBMS

How does Hadoop perform Input Splits?

The Hadoop's InputFormat<K, V> is responsible to provide the input splits. The InputFormat<K,V> describes the input-specification for a MapReduce job. The interface InputFormat's full name is
org.apache.hadoop.mapred.InputFormat<K,V>.

According to Hadoop: the MapReduce framework relies on the InputFormat of the job to:

  1. Validate the input-specification of the job.

  2. Split-up the input file(s) into logical InputSplit(s), each of which is then assigned to an individual Mapper.

  3. Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.

The InputFormat interface has 2 functions:

// 1. Get the RecordReader for the given InputSplit.
RecordReader<K,V>   getRecordReader(InputSplit split, JobConf job, Reporter reporter)

// 2. Logically split the set of input files for the job.
InputSplit[]    getSplits(JobConf job, int numSplits)

In general, if you have N nodes, the HDFS will distribute the input file(s) over all these N nodes. If you start a job, there will be N mappers by default. The mapper on a machine will process the part of the data that is stored on this node.

MapReduce/Hadoop data processing is driven by this concept of input splits. The number of input splits that are calculated for a specific application determines the number of mapper tasks.

The number of maps is usually driven by the number of DFS blocks in the input files. Each of these mapper tasks is assigned, where possible, to a worker node where the input split (InputSplit) is stored. The Resource Manager does its best to ensure that input splits are processed locally (for optimization purposes).

Sort & Shuffle function in MapReduce/Hadoop

Shuffle phase in Hadoop transfers the map output (in the form of (key, value) pairs) from Mapper to a Reducer in MapReduce. Sort phase in MapReduce covers the merging and sorting of mappers outputs. Data from the mapper are grouped by the key, split among reducers and sorted by the key. Every reducer obtains all values associated with the same key.

For example, if there were 3 input chunks/splits, (and each chunk go to a different server) then mappers create (key, value) pairs per split (also called partitions), consider all of the output from all of the mappers:

    Partition-1  Partition-2  Partition-3
    (A, 1)       (A, 5)       (A, 9)
    (A, 3)       (B, 6)       (C, 20)
    (B, 4)       (C, 10)      (C, 30)
    (B, 7)       (D, 50)
  (A, 100)

Then the output of Sort & Shuffle phase will be (note that the values of keys are not sorted):

    (A, [1, 3, 9, 5, 100])
    (B, [4, 7, 6])
    (C, [10, 20, 30])
    (D, [50])

Output of Sort & Shuffle phase will be input to reducers.

Therefore, Sort & Shuffle creates its outputs in the following form:

(key, [v_1, v_2, ..., v_n])

where all mappers have created:

(key, v_1),
(key, v_2),
...
(key, v_n)

NoSQL Database

NoSQL databases (aka "not only SQL") are non-tabular databases and store data differently than relational tables. NoSQL databases come in a variety of types. Redis, HBase, CouchDB and MongoDB, ... are examples of NoSQL databases.

References

  1. Data Algorithms with Spark by Mahmoud Parsian

  2. Data Algorithms by Mahmoud Parsian

  3. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms by Jimmy Lin

  4. Google’s MapReduce Programming Model — Revisited by Ralf Lammel

  5. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat

  6. Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer

  7. MapReduce Design Patterns by Donald Miner, Adam Shook

  8. Hadoop: The Definitive Guide, 4th Edition by Tom White

  9. Learning Spark, 2nd Edition by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

  10. Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman

  11. Chapter 2, MapReduce and the New Software Stack by Jeff Ullman

  12. A Very Brief Introduction to MapReduce by Diana MacLean

  13. Apache Hadoop MapReduce Tutorial, 2022-07-29

  14. Big Data Glossary by Pete Warden, 2011, O'Reilly

  15. What is Lineage In Spark?

  16. RDD lineage in Spark: ToDebugString Method

  17. Lazy Evaluation in Apache Spark

  18. Advanced Analytics with PySpark by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills

  19. 8 Steps for a Developer to Learn Apache Spark with Delta Lake by Databricks

  20. Apache Spark Key Terms, Explained

  21. How Data Partitioning in Spark helps achieve more parallelism?

  22. Apache Spark Officially Sets a New Record in Large-Scale Sorting

  23. Understanding Parquet and its Optimization Opportunities

  24. Columnar Vs. Row Oriented Databases - The Basics

  25. What is Parquet by Databricks

  26. Spark File Format Showdown – CSV vs JSON vs Parquet by Garren Staubli

  27. Introduction to SQL

  28. What are the 4 Vs of Big Data? by Bernard Marr